Start with a short introduction about what is going on in this document
Introduction to PDCA, CAST and the methods contained within this document go here.
Explain some background info that is needed. Change header colours to purple?
Explain Bayesian theory briefly here.. We want a distribution of the forecasts, but don’t have the forecasts to make this from.. Bayesian theory plays part here. Can estimate the Posterior of a distribution by random sampling from a prior distribution..
Step through explanation of 2 stream sample data included. Plot both oil and gas production. Pick one well to apply across. Should this be before the arps parameters? Also show the uwi.matrix table where we summarize important parameters from each well
Key parameters for each production dataset have been summarized and compiled into a table. Cumulative months on production, initial oil and gas rates, peak oil and gas rates, peak month, and cumulative volumes can be seen for each UWI. These parameters will be used in subsequent calculations and help us prioritize wells as we begin to look at learning from wells with more production history later in PDCA process.
| UWI | OnProd Date | Initial Oil Rate | Peak Oil Rate | Peak Month | Cum Oil |
|---|---|---|---|---|---|
| 100/05-17-081-16W6/00 | 2018-04-01 | 26.123677 | 328.6216 | 3 | 62958.48 |
| 100/11-17-081-16W6/00 | 2018-05-01 | 219.027416 | 544.5717 | 2 | 95700.09 |
| 100/12-17-081-16W6/00 | 2018-05-01 | 265.308244 | 538.2610 | 2 | 82928.63 |
| 102/05-17-081-16W6/00 | 2018-04-01 | 6.646233 | 406.3637 | 3 | 39977.40 |
| 102/12-17-081-16W6/00 | 2018-05-01 | 135.210625 | 439.8464 | 2 | 60875.30 |
| 103/05-17-081-16W6/00 | 2018-04-01 | 7.631636 | 378.4579 | 3 | 58147.41 |
Arps is a common methodolgy for forecasting well production.
Explanation of that equation and a definition using code..?
We’ll define the modified hyperbolic equation in executable form. The equation will accept decline parameters and a time series across which it can evaluate production rate. First we’ll need a function to convert secant decline percentage into nominal form.
## to.nominal function converts effective decline rates to nominal decline rates
to.nominal<-function(De,b=0){
if(b==0){
a <- -log(1-De)
} else if(b==1){
a <- De/(1-De)
} else{
a <- (1/b) * (((1-De)^(-b))-1)
}
return(a)
}
## MH function is the modified hyperbolic function, where a single b value transitions
## to an exponential decline at a limiting decline rate
# Function will accept equation constants and an elapsed time value in days?
# Output will be rate at given time for given parameters
MH<-function(qi, b = 1.5, Di = 0.7, Dlim = 0.07, q.abd = 3, t = c("Enter In Days")){
#qi must be entered as a rate
#Di must be decimal in effective secant format as % year
#t must be entered as days
#Convert Di and Dlim to nominal
ai <- to.nominal(Di)
alim <- to.nominal(Dlim)
#Convert t to yearly time fraction
t <- t/365.25
#Add error handlers for true exponential decline and for if intial parameters
# bypass the exponential decline function.. I.e. if di<dlim.. error
#If di > dlim, use Di and no Dlim until end of time
if(b==0){
#Exponential Decline
q<-qi*exp(-ai*t)
} else{
#Calculate limiting decline rate
qlim <- qi*(alim/ai)^(1/b)
#Calculate limiting decline time, in years
tlim <- (((qi/qlim)^b)-1)/(b*ai)
#Calculate q assuming hyperbolic
q<-qi/((1+b*ai*t)^(1/b))
#Replace q when t > tlim
q[t > tlim] <- qlim*exp(-alim*(t[t>tlim] - tlim))
#Abandonment cutoff. When q < q.abd, q = 0
q[q < q.abd] <- 0
}
return(q)
}Explain the background of what it is and how it can be helpful in forecasting, especially to handle uncertainty. Define PDCA here! Explain briefly bayesian methods of achieving this? Priors and posteriors
Explain the place of the monte carlo method as our first try at estimating the posterior ditribution, showing parameter distributions as defining our prior and explaining how they are each sampled. Have inputs here for monte carlo, build any functions required, and run. Show nice clean histograms overlaid of the sampling for b, di, and qi based on the input criteria..
# Define distribution constraints for our Monte Carlo sampling of arps parameters
# Only oil values will be defined here.
## bi values
o.bi_min <- 0.5 #Initial oil b value min value
o.bi_max <- 2 #Initial oil b value max value
## di values; as decimals in secant. Will be converted to nominal in arps equation.
o.di_min <- 0.4 #Initial oil decline percentage min
o.di_max <- 0.98 #Initial oil decline percentage max
## qi multiplier or absolute values; Multipliers will be applied to peak intial rates.
o.qi_min <- 10 #Absolute min oil rate (bbls/d) for qi
o.qi_max <- 3 #Peak oil rate multiplier to define max oil qi (3 * peak_rate)
## dlim values; as decimals in secant.
o.dlim <- 0.07 #Limiting decline percentage used in modified hyperbolic equation
# Abandonment values; as rates in production units
o.abd <- 3
g.abd <- 10
# To handle multiple distribution types, let's define a "DefineDistribution" class
# The class will contain distribution information in a list, defined in S4
# Type = Normal, Lognormal, or Uniform; min_mean = min bound for uniform or mean of lognormal
# or normal; max_sd = max bound for uniform or standard deviation for lognormal or normal
setClass("DefineDistribution",slots = list(type="character",min_mean="numeric",max_sd="numeric"))
# Using this class, we will define uniform distributions for all of our inputs
b <- new("DefineDistribution",type = "Uniform",min_mean = o.bi_min,max_sd = o.bi_max)
di <- new("DefineDistribution",type = "Uniform",min_mean = o.di_min,max_sd = o.di_max)
dlim <- new("DefineDistribution",type = "Uniform",min_mean = o.dlim,max_sd = o.dlim)
#Let's pick the first well 100/05-17-081-16W6/00 for a peak rate to show the distribution of qi
o.qpeak <- matrix[1,"Oil.Peak.Rate"]
#Then define it's distribution
qi <- new("DefineDistribution",type = "Uniform",min_mean = o.qi_min, max_sd = o.qpeak*o.qi_max)Let’s visualize our defined distributions and overlay our samples from them. We’ll use a function named sample.any that is defined in the background to simply accept a DefineDistribution class object and randomly sample based on it’s type and inputs.
# Sample distributions using custom function sample.any
b.sample <- sample.any(n=10000,distribution = b)
di.sample <- sample.any(n=10000,distribution = di)
qi.sample <- sample.any(n=10000, distribution = qi)
# Note here Dlim is constant at 7% so we won't show it's distribution
# Display distributions using ggplot2 histograms
ggplot(data = data.frame(b = b.sample), aes(x = b)) +
geom_histogram(aes(y= ..density..), colour = "black", fill = "grey", alpha = .5) +
geom_density(alpha=.2, fill="#763F98") +
theme_classic() +
theme(text = element_text(size = rel(5)))
ggplot(data = data.frame(di = di.sample), aes(x = di)) +
geom_histogram(aes(y= ..density..), colour = "black", fill = "grey",alpha=.5) +
geom_density(alpha=.2, fill="#763F98") +
theme_classic() +
theme(text = element_text(size = rel(5)))
ggplot(data = data.frame(qi = qi.sample), aes(x = qi)) +
geom_histogram(aes(y= ..density..), colour = "black", fill = "grey",alpha=.5) +
geom_density(alpha=.2, fill="#763F98") +
theme_classic() +
theme(text = element_text(size = rel(5)))Uniform Distributions of b, di and qi
Explain here and have links to jump around document? Or explain when we get to it?
Explain process of forecast sythesis here, and show how it is carried out for the modified hyperbolic. Show arps substitution, and explain
Let’s combine and organize our sampled parameters in a data frame. We’ll use an equation-specific function to ensure completeness and repeatability for multi-well matching.
# The MHSample function will sample and combine all parameters based on their
# DefineDistribution classes. Method input will return different output type based on
# requirement to speed up MCMC calculations in the future.. Ignore for now.?
MHSample<-function(n=10000,qi,b,di,dlim){
#Sample relevant parameters for MH function: qi, b, di, dlim
#Inputs of class distribution created prior to function call, containing distribution information
# for each parameter
#Function will output a Monte Carlo matrix with n x m dimensions, where n is sample number
# and m is the number of parameters needed for MH equation
#Step through each parameter and call sample.any function
qi.out <- sample.any(n,qi)
b.out <- sample.any(n,b)
di.out <- sample.any(n,di)
dlim.out <- sample.any(n,dlim)
#Combine each returned column into dataframe for output
sampledParams <- data.frame("qi"=qi.out,"b"=b.out,"di"=di.out,"dlim"=dlim.out)
#Return sampled dataframe
return(sampledParams)
}
# Pass the previously defined distributions into the MH.sample function
# Note this is repeating our earlier sampling in a cleaner form
MH.sampled.parameters <- MHSample(n = 10000,qi = qi,b = b,di = di,dlim = dlim)
# Return in table, displaying first 10 rows only
kable(MH.sampled.parameters[c(1:10),], align = 'c') %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive",
full_width = F, position = "center"))| qi | b | di | dlim |
|---|---|---|---|
| 566.07793 | 1.8853023 | 0.4733630 | 0.07 |
| 91.85886 | 0.7974686 | 0.9570567 | 0.07 |
| 408.19916 | 1.0687344 | 0.8342801 | 0.07 |
| 632.53241 | 0.5597691 | 0.8313838 | 0.07 |
| 779.20478 | 1.0869983 | 0.5657342 | 0.07 |
| 414.54693 | 1.0495356 | 0.9655378 | 0.07 |
| 373.23016 | 1.8459167 | 0.8551814 | 0.07 |
| 186.64516 | 1.5416893 | 0.9038671 | 0.07 |
| 216.14082 | 1.0059336 | 0.4922665 | 0.07 |
| 264.73093 | 1.2307099 | 0.5992085 | 0.07 |
Using the combined parameter dataframe and the modified hyperbolic equation, we can define a function to use each row of the parameter dataframe as a Monte Carlo iteration and substitute it into the arps equation across the time interval of the current well. The function will create and return a mcIteration object for each iteration. This object is a list with slots for Iteration name, Parameters, and Synth. Some additional slots will be created and populated later as we approach cost and forecasting.
# A function, MH.synthesize, will be defined to carry out rate synthesis across the actuals
# time interval.
MH.synthesize<-function(parameters,q.abd,t){
#Parse up parameters and pass into MH function call
fc <- MH(parameters[["qi"]],parameters[["b"]],parameters[["di"]],parameters[["dlim"]],q.abd,t)
#Simplify parameters into a single string representation
parametersName<-paste(names(parameters),round(parameters,digits=2),sep = "=",collapse=" ")
#Create mcIteration imitation list to improve performance and ease of combining data
#Return mcIteration. Remaining list items to be appended by later functions
return(list("Iteration" = parametersName,"Parameters" = parameters,"Synth"=fc))
}
# Define a time interval of actuals to synthesize across, using 100/05-17-081-16W6/00
uwi<-"100/05-17-081-16W6/00"
# Subset production including only 100/05-17-081-16W6/00
singleWell.all <- formated_production$CD[which(formated_production$CD$Unique.Well.ID ==
uwi),]
# Find peak oil month for this uwi
uwi.pm <- matrix[["Oil.Peak.Month"]][which(matrix$eachUWI==uwi)]
# Refine production datatable to only include data after peak oil rate
singleWell<-singleWell.all[-c(1:(uwi.pm-1)),]
# Create time vector starting at 0 for peak oil month
synth.time <- (singleWell$CD.Month - uwi.pm) * 30.4 # Converted to daily
# Let's apply the MH.synthesize function in an apply call to execute across each row
# of the parameters dataframe
oil.synth<-apply(MH.sampled.parameters,1,MH.synthesize,q.abd = o.abd, t = synth.time)
# Print first object of oil.synth to show mcIteration object
print(oil.synth[[1]])## $Iteration
## [1] "qi=566.08 b=1.89 di=0.47 dlim=0.07"
##
## $Parameters
## qi b di dlim
## 566.077933 1.885302 0.473363 0.070000
##
## $Synth
## [1] 566.0779 538.0106 513.6163 492.1628 473.1066 456.0348 440.6274 426.6323
## [9] 413.8474 402.1089 391.2825 381.2565 371.9376 363.2471 355.1178 347.4921
## [17] 340.3206 333.5602 327.1732
Each mcIteration list, defining a Monte Carlo iteration and historical synthesis, can now be evaluated against the actual data to search for an appropriate model.
How do we compare the synthesis forecasts against our actuals and determine what is a good fit and what is a poor fit?
To compare each synthesis to the historical production data observed, we need to use functions that compare predicted points to actuals for each model. These functions are commonly referred to as loss functions. The most common loss function is the L2 loss function (insert link here?), which is popular due to its easy implementation, especially in linear models. However, the L2 loss is not as useful when comparing data with significant outliers, such as oil and gas production data, because it assigns too much weight to these outliers and rejects models that do not match them. We will need to look at more robust loss functions to find models that match the historical data as an engineer would.
Soft L1 loss is a robust, continuous variation of the common Huber Loss that gives less weight to outliers than a L2 loss, making it useful for fitting time series production data. Our Soft L1 definition will return the individual loss of each point and an average cost for all residuals input.
soft_L1<-function(r){
IndvCost <- 2*((1 + abs(r))^0.5 - 1)
Cost <- sum(IndvCost,na.rm = TRUE) / length(IndvCost)
return(list(IndvCost,Cost))
}
To apply the loss equation we’ll define a function to apply the Soft L1 loss to the residuals of each model match. The function will include an option to apply the loss function against the log residuals, an option that we’ll take advantage of due to the non-linear nature of production data. The function will populate the IndvCost and Cost slots of each mcIteration object, and once the function is defined, we can use these slots to visualize each model’s fit.
## applyLoss function will be used to accept a list of mcIteration objects, an actuals dataframe,
# a loss function to apply, and an option to use log residuals or absolute residuals,
# then populate the cost and avg_cost slots of the mcIteration objects and return them.
# Bestfit will contain c(eps_mean_min,eps_std_min) for pdca
applyLoss<-function(iterationObject = c("mcIteration List"),
actuals = c("Matrix with actual data"), loss.function = soft_L1,
log.residuals = TRUE){
# Calculate residuals from model synthesis to actuals. If log.residuals is TRUE,
# return the log of the residuals
if (log.residuals == TRUE){
r <- log(actuals) - log(iterationObject$Synth)
} else {
r <- actuals - iterationObject$Synth
}
# Apply loss function to residuals. Returns a list of IndvCost and Cost.
rho <- loss.function(r)
# Append IndvCost and avg_cost (named cost) to the mIteration imitation list calling from
# rho list slots
iterationObject$IndvCost <- rho[[1]]
iterationObject$Cost <- rho[[2]]
# return iterationObject
return(iterationObject)
}
# Use the applyLoss function in an lapply call across each object in oil.synth
oil.synth <- lapply(oil.synth, applyLoss, actuals = singleWell[["Oil.Rate"]],
loss.function = soft_L1, log.residuals = FALSE)
# Repeat for log residuals
oil.synth.log <- lapply(oil.synth, applyLoss, actuals = singleWell[["Oil.Rate"]],
loss.function = soft_L1, log.residuals = TRUE)
# Updated mcIteration object
print(oil.synth[[1]])## $Iteration
## [1] "qi=566.08 b=1.89 di=0.47 dlim=0.07"
##
## $Parameters
## qi b di dlim
## 566.077933 1.885302 0.473363 0.070000
##
## $Synth
## [1] 566.0779 538.0106 513.6163 492.1628 473.1066 456.0348 440.6274 426.6323
## [9] 413.8474 402.1089 391.2825 381.2565 371.9376 363.2471 355.1178 347.4921
## [17] 340.3206 333.5602 327.1732
##
## $IndvCost
## [1] 28.88406 32.56231 34.73612 35.90350 35.83259 35.68141 34.42292 34.88264
## [9] 34.77318 34.58031 34.41093 34.43722 33.83672 33.68369 33.30451 32.98297
## [17] 32.87979 32.56150 32.27778
##
## $Cost
## [1] 33.82285
With cost calculated using the Soft L1 loss function applied to the standard and log residuals for every model iteration, let’s use the average cost to find our “best fit” model out of the 10,000 iterations. We’ll plot the model against the actual oil rate, coloured by individual cost for both residual method to look at how each handles outliers.
# Use lapply and do.call to combine the average cost of every model object
# into a single dataframe
oil.cost <- do.call(rbind,lapply(oil.synth,function(x){
return(x$Cost)
}))
# Repeat for log residuals
oil.cost.log <- do.call(rbind,lapply(oil.synth.log,function(x){
return(x$Cost)
}))
# Find index of best fit model using log soft_L1 residuals
oil.bf.log <- which.min(oil.cost.log)
# Plot this "best fit" against actuals and colour by standard and log residual cost
# To plot, we'll utilize some custom plotly functions defined in plotResources external file
oil.synth.plot <- plotly.multimodel.indvcostcolor(actuals.x = singleWell[["CD.Month"]],
actuals.y = singleWell[["Oil.Rate"]],
iterationObject = oil.synth[oil.bf.log],
scales = "semi-log")
oil.synth.plot <- layout(oil.synth.plot, xaxis = list(title = "Cal-Day Month"),
yaxis = list(title = "Oil Rate (bbls/d)"),
title = "Soft L1 Cost of Residuals")
oil.synth.plot.log <- plotly.multimodel.indvcostcolor(actuals.x = singleWell[["CD.Month"]],
actuals.y = singleWell[["Oil.Rate"]],
iterationObject = oil.synth.log[oil.bf.log],
scales = "semi-log")
oil.synth.plot.log <- layout(oil.synth.plot.log, xaxis = list(title = "Cal-Day Month"),
yaxis = list(title = "Oil Rate (bbls/d)"),
title = "Soft L1 Cost of Log Residuals")
# Combine into a subplot
combined<-subplot(oil_prod,gas_prod,nrows = 2,shareY = TRUE,titleX = FALSE,titleY = TRUE)
combined<-layout(combined,showlegend=FALSE,title="Oil & Gas Production",xaxis2=list(title="Cal-Day Month"))
combined.cost <- subplot(oil.synth.plot,oil.synth.plot.log,nrows = 1,titleX = TRUE,titleY=FALSE)
combined.cost <- layout(combined.cost,title = "Soft L1 Cost of Absolute vs. Log Residuals",
yaxis=list(title="Oil Rate (bbls/d)"))
#IMPROVE TOOLTIP???
hide_colorbar(combined.cost)Comparing the absolute residual cost in the left plot against the log residual cost in the right plot, there is similar relative spread in the residuals and both cost the peak rate similarily, but the log residuals does a better job at costing the sub peak at month 9. We will carry log residual cost through the remainder of the CAST methology as it capture the non-linearity of the production data better. One important thing to notice before moving on is that the Soft L1 function gives a high cost to the peak oil rate and does not match it perfectly as a result. This may be concerning when compared to traditional DCA matches, but given that early time production data can be far from steady-state flowing behaviour, it is a useful feature to maintain.
Using the log residuals and the Soft L1 cost function, let’s find the top 100 models from our initial 10,000 iterations and see how they describe the ranges of our actual oil rates. We’ll colour them by average cost to compare the forecasts.
# Find and plot top 100 (top 1%) models using the ggplot resource. Colour by cost.
oil.synth.P1 <- which(oil.cost.log <= quantile(oil.cost.log,0.01))
#Sort from highest cost to lowest for plotting visuals
oil.synth.P1 <- oil.synth.P1[order(oil.cost.log[oil.synth.P1],decreasing = TRUE)]
oil.plot.P1 <- ggplot.multimodel.costcolor(actuals.x = singleWell[["CD.Month"]],
actuals.y = singleWell[["Oil.Rate"]],
iterationObject = oil.synth.log[oil.synth.P1],
scales = "semi-log")
# Add titles and output
oil.plot.P1 <- oil.plot.P1 +
xlab("Cal-Day Month") +
ylab("Oil Rate (bbls/d)")
oil.plot.P1Top 1% of Models Using Soft L1 Loss
We can see that the top 100 forecasts do a good job of covering the range of actual oil rates, with the lowest cost forecasts tightly grouped around the actuals. We’ve shown we can use Monte Carlo sampling paired with a loss function to match historical production! Before moving on let’s note that the majority of the forecasts are under-predicting the intial rate. This is primarily driven by the high cost of the peak oil rate using the Soft L1, but could also be improved through our input distributions. This is something we can improve upon later.
Explain this loss function, where it’s from, and show a comparison of application to the Soft L1 loss
The Soft L1 loss function does a decent job at finding a best fit model, but is there a better loss function that can be used? Let’s look at a more complete loss function defined in SPE-174784-PA (ADD LINK HERE AS REFERENCE) by Fulford et al. takes a similar approach, but .. EXPLAIN HERE
# As our Soft L1 was defined, pdca_loss will slots for IndvCost and Cost
pdca_loss<-function(r, Eps_mean_min = 0, Eps_std_min = 0){
IndvCost <- r^2 # pdca loss simply uses L2 loss for individual cost
Eps_mean <- mean(r,na.rm = TRUE) #Mean of residuals
Eps_std <- sd(r,na.rm=TRUE) #Std of residuals
Cost <- ((Eps_mean - Eps_mean_min)/.1)^2+((Eps_std - Eps_std_min)/.01)^2
return(list(IndvCost,Cost))
}Just as we did with the Soft L1, let’s use the PDCA loss function to find the top 100 forecasts from the same 10,000 iterations. We’ll see the minimum mean and minimum standard deviation of epsilon to 0, assuming a perfect model. Again we’ll colour by average cost and compare to the Soft L1 method.
# Let's modify the applyLoss function slightly to accept the pdca bestfit parameters
# eps_min input will contain c(eps_mean_min,eps_std_min) for pdcaLoss
applyLoss<-function(iterationObject = c("mcIteration List"),
actuals = c("Matrix with actual data"), loss.function = soft_L1,
log.residuals = TRUE,
eps_min = c(0,0)){
# Calculate residuals from model synthesis to actuals. If log.residuals is TRUE,
# return the log of the residuals
if (log.residuals == TRUE){
r <- log(actuals) - log(iterationObject$Synth)
} else {
r <- actuals - iterationObject$Synth
}
# Apply loss function to residuals. Returns a list of IndvCost and Cost.
# Since pdca_loss is used, pass bestfit. Default to 0,0 as ideal bestfit.
rho <- loss.function(r,eps_min[1],eps_min[2])
# Append IndvCost and avg_cost (named cost) to the mIteration imitation list calling from
# rho list slots
iterationObject$IndvCost <- rho[[1]]
iterationObject$Cost <- rho[[2]]
# return iterationObject
return(iterationObject)
}
# Apply PDCA loss function
oil.synth.PDCA <- lapply(oil.synth, applyLoss, actuals = singleWell[["Oil.Rate"]],
loss.function = pdca_loss, log.residuals = TRUE)
# Use lapply and do.call to combine each object loss into a single dataframe
oil.cost.PDCA <- do.call(rbind,lapply(oil.synth.PDCA,function(x){
return(x$Cost)
}))
# Find and plot top 100 (top 1%) models using the ggplot resource. Colour by PDCA cost.
oil.synth.pdcaP1 <- which(oil.cost.PDCA <= quantile(oil.cost.PDCA,0.01,na.rm = TRUE))
#Sort from highest cost to lowest for plotting visuals
oil.synth.pdcaP1 <- oil.synth.pdcaP1[order(oil.cost.PDCA[oil.synth.pdcaP1],decreasing = TRUE)]
oil.plot.pdcaP1 <- ggplot.multimodel.costcolor(actuals.x = singleWell[["CD.Month"]],
actuals.y = singleWell[["Oil.Rate"]],
iterationObject = oil.synth.PDCA[oil.synth.pdcaP1],
scales = "semi-log")
# Add titles and output
oil.plot.pdcaP1 <- oil.plot.pdcaP1 + ggtitle("PDCA Loss Function") +
xlab("Cal-Day Month") +
ylab("Oil Rate (bbls/d)")
oil.plot.P1 + ggtitle("Soft L1 Loss Function")
oil.plot.pdcaP1Comparing Top 1% of Models Using Soft L1 and PDCA Loss
Before we move on to forecasting, let’s look at the parameter distributions for the top 1% of models to get an idea of the solution space using the pdca loss method.
# Combine matched parameters into single data frame and plot using three histograms (1 for qi,
# 1 for b, and 1 for di). This will give us a good representation of the solution space.
# Combine P1 parameters into a dataframe
acceptedParameters <- do.call(rbind,lapply(oil.synth.PDCA[oil.synth.pdcaP1],function(x){
return(cbind(x$Parameters[1],x$Parameters[2],x$Parameters[3]))
}))
colnames(acceptedParameters)<-c("qi","b","di")
rownames(acceptedParameters)<-c()
acceptedParameters<-as.data.frame(acceptedParameters)
# Display distributions using ggplot2 histograms
ggplot(data = acceptedParameters, aes(x = b)) +
geom_histogram(aes(y= ..density..), colour = "black", fill = "grey", alpha = .5) +
geom_density(alpha=.2, fill="#763F98") +
theme_classic() +
theme(text = element_text(size = rel(5)))
ggplot(data = acceptedParameters, aes(x = di)) +
geom_histogram(aes(y= ..density..), colour = "black", fill = "grey",alpha=.5) +
geom_density(alpha=.2, fill="#763F98") +
theme_classic() +
theme(text = element_text(size = rel(5)))
ggplot(data = acceptedParameters, aes(x = qi)) +
geom_histogram(aes(y= ..density..), colour = "black", fill = "grey",alpha=.5) +
geom_density(alpha=.2, fill="#763F98") +
theme_classic() +
theme(text = element_text(size = rel(5)))Parameters of P1 model matches using PDCA Loss method
Summarize observations from these distributions.. They are mostly lognormal. Therefore, do we need to step into MCMC to get a proper PDCA distribution going forward? Keep these in mind and compare later on..
Summarize findings in these two charts and how we’re going to carry PDCA loss going forward in the rest of the evaluation / process as it provides a better probabilistic representation. Show distribution fits of the accepted parameters? Or do this at a later time? Or here just FYI??
This has already been done. Still need?
Look at best fit for the PDCA loss function as we’ve selected it as our go forward loss function. Show how it does and then create a forecast from it for visuals sake. Next add forecasts of the top wells. This may be where we need to start considering a few wells.. Show how it does for good production data and how bad it does for good production data? Or is this another document all together?
Let’s forecast out the best fit model from our PDCA loss function to take a look at its long term projection.
# Define a function similar to MH.synthesis that will accept mcIteration objects,
# parse out each iteration parameters, and synthesize the full forecast using the
# MH arps model. It will populate a Forecast slot in mcIteration and return.
MH.forecast <- function(iterationObject = c("mcIteration List"),q.abd = 3,
t.forecast=c(0:359)*30.4){
# Parse up iteration parameters and pass into MH function call
parms <- iterationObject$Parameters
fc <- MH(parms[["qi"]],parms[["b"]],parms[["di"]],parms[["dlim"]],q.abd,t.forecast)
# Add forecast return to forecast slot of mcIteration
iterationObject$Forecast <- fc
# Return populated mcIteration object
return(iterationObject)
}
# Let's call the MH.forecast function using lapply to apply over the P1 forecasts
# Define forecast time
t.forecast <- c(0:359) # 30 years
# Execute call
oil.forecast <- lapply(oil.synth.PDCA[oil.synth.pdcaP1], MH.forecast,q.abd = o.abd,
t.forecast = t.forecast*30.4)
# Plot results
oil.plot.forecast <- ggplot.forecast.costcolor(actuals.x = singleWell[["CD.Month"]],
actuals.y = singleWell[["Oil.Rate"]],
iterationObject = oil.forecast,
forecast.x = t.forecast + 3,
scales = "semi-log")
# Add titles and output as iteractive
oil.plot.forecast <- oil.plot.forecast +
xlab("Cal-Day Month") +
ylab("Oil Rate (bbls/d)") +
xlim(0,300)
ggplotly(oil.plot.forecast)
We can see the forecast distribution appears to summarize our actual data well. Let’s compile the forecasts into P10, Mean, P50, and P90 curves to represent all the forecasts. We will need to define a function to accept a list of objects and return a dataframe with the consolidated forecasts, then pass it into a custom ggplot function to plot against our actuals and accepted forecasts.
## consolidate.rates will be used to accept a list of mcIteration objects with forecasts,
# combine the forecasts into a dataframe to perform probabilistic calcs, and output a single
# dataframe with x, P10, Mean, P50, and P90 consolidations. Only input accepted objects.
consolidate.rates <- function(iterationObjects = c("List of mcIteration objects with forecasts
to consolidate"),
forecast.x = c("dataframe containing x values used in forecasts"),
q.abd = o.abd,trim = c("Y / N")){
# For each iterationObject, use an lapply and do.call to combine into a single dataframe
consolidation.df <- do.call(rbind, lapply(iterationObjects , function(itObject){
return(itObject$Forecast)
}))
# Perform statistical calcs across each column for each timestep
mean <- colMeans(consolidation.df,na.rm = TRUE)
P90 <- apply(consolidation.df,2,quantile,.1,na.rm=TRUE)
P50 <- colMedians(consolidation.df,na.rm=TRUE)
P10 <- apply(consolidation.df,2,quantile,.9,na.rm=TRUE)
# Cut off forecasts at abandonment rate
mean[mean<q.abd]<-0
P90[P90<q.abd]<-0
P50[P50<q.abd]<-0
P10[P10<q.abd]<-0
# Combine into single dataframe with forecast.x
consolidation <- data.frame("x" = forecast.x, "P10" = P10, "Mean" = mean,
"P50" = P50, "P90" = P90)
# Trim length to remove rows with all zeros
if(trim == "Y"){
trim <- rowSums(consolidation[,2:5], na.rm = TRUE)
consolidation <- consolidation[-which(trim == 0),]
}
# Return trimmed result
return(consolidation)
}
# Let's consolidate the accepted forecasts using this function
oil.consolidated <- consolidate.rates(oil.forecast,forecast.x = t.forecast + 3, q.abd = o.abd,
trim = "Y")
# Plot results
oil.plot.consolidated <- ggplot.forecast.consolidated(actuals.x = singleWell[["CD.Month"]],
actuals.y = singleWell[["Oil.Rate"]],
iterationObject = oil.forecast,
forecast.x = t.forecast + 3,
scales = "semi-log",
consolidation = oil.consolidated)
# Add titles and output as iteractive
oil.plot.consolidated <- oil.plot.consolidated +
xlab("Cal-Day Month") +
ylab("Oil Rate (bbls/d)") +
xlim(0,300)
ggplotly(oil.plot.consolidated)Let’s compare these mean, P10, P50, and P90 curves made by aggregated individual rates from each time interval, to curves made by aggregating representative forecasts for mean, P10, P50, and P90 on 30 year cumulative volumes. The decline parameters inside +/- 0.5 percentile neighbourhoods around each target were averaged to summarize modified hyperbolic curves for each probabilistic outcome.
Re-create something like this after the MCMC algorithm. Prove why it needs to be used? This algorithm above is used for the type well methodology..
As the forecast values change from including historical to not, be sure that the whole curve is synthesized and then cut-off for forecast.x.. as q, di, b are related to those parameters..
Both charts show similar results. We can see the P10, P50, and P90 forecasts in dashed lines, overlaid with the mean in solid black. While it looked like our forecasts represented the actuals and an appropriate forecast distribution, our P50 and Mean appear to be slightly over-estimating on the late-time actuals. Let’s investigate methods at improving this distribution, namely, Markov Chain Monte Carlo methods.
Explain concept or link above if put there. Define metropolis-hasting algorightm.
Use original MCMC run as burn-in.
To proceed through the Metropolis-Hasting algorithm for MCMC, we need to define jumping distributions and use them to propose the next test parameter set.
# Define mcmc.eval function to evaluate sampled jump for acceptance probability.
# Returns x_i+1 if accepted, else, xi
# eps_min input used to pass eps_mean_min and eps_std_min of bestfit for pdca_loss function
mcmc.eval <- function(currentIteration = c("mcIteration object defining x_i"),
sampledParms = c("dataframe of next iteration parameters"),
loss_function = c("loss function to evaluate acceptance"),
synth_function = c("synthesis function"),
actuals = c("data frame with actuals"),
actual.t = c("time to synthesize"),
q.abd = c("abandonment rate"),
eps_min = c(0,0)){
# Synthesize data using sampledParms and synth_function
proposed <- synth_function(sampledParms,q.abd,actual.t)
# Call applyLoss function to evaluation sampled parameters
proposed <- applyLoss(proposed,actuals,loss_function,log.residuals = TRUE,eps_min)
# Evaluate acceptance probability
A <- exp(-1/2*proposed$Cost) / exp(-1/2*currentIteration$Cost)
# Evaluate acceptance
if(runif(1)<A){
output <- proposed
} else{
output <- currentIteration
}
# Return proposed mcIteration object and acceptance probability
return(output)
}
# Define MH.mcmc to perform MCMC Metropolis Hastings algorithm for modified hyperbolic model
# Returns list "chain" of accepted mcIteration objects.
MH.mcmc <- function(iterations = 10000,
initial = c("starting mcIteration object"),
loss_function = c("loss function to evaluate acceptance"),
jump = c("list of jumping distributions"),
bounds = c("list of bounding distributions"),
actuals = c("actual rates"),
actual.t = c("actual time"), q.abd = c("abandonment rate")){
# Initialize chain
chain <- vector("list",iterations)
# Assign initial mcIteration object to chain
chain[[1]] <- initial
# Calculate eps_mean_min and eps_std_min from initial as bestfit for pdca loss
r <- log(actuals) - log(initial$Synth)
eps_min <- c(mean(r,na.rm = TRUE),sd(r,na.rm = TRUE))
# Define bounds for each parameter
bound <- lapply(bounds,function(x){
if(x@type == "Uniform"){
c(x@min_mean,x@max_sd)
} else if(x@type == "Normal"){
c(qnorm(0.01,x@min_mean,x@max_sd),qnorm(0.99,x@min_mean,x@max_sd))
} else {
c(qlnorm(0.01,x@min_mean,x@max_sd),qlnorm(0.99,x@min_mean,x@max_sd))
}
})
# Perform MCMC Metropolis Hasting algorithm across number of iterations
for (i in c(2:iterations)){
# Assign starting points of jumping distributions to previous iteration
# Use ln values for qi and di
jump[[1]]@min_mean <- log(chain[[i-1]]$Parameters[[1]])
jump[[2]]@min_mean <- chain[[i-1]]$Parameters[[2]]
jump[[3]]@min_mean <- log(chain[[i-1]]$Parameters[[3]])
jump[[4]]@min_mean <- chain[[i-1]]$Parameters[[4]]
# Sample model parameters using jumping distributions
sampledParms <- apply(MHSample(1,jump[[1]],jump[[2]],jump[[3]],jump[[4]]),2,
function(x){x})
# Revert log parameters
sampledParms[c(1,3)]<-exp(sampledParms[c(1,3)])
# Compare sampled parameters to bounding distributions
if(sampledParms[["qi"]] <= bound[[1]][2] & sampledParms[["qi"]] >= bound[[1]][1] &
sampledParms[["b"]] <= bound[[2]][2] & sampledParms[["b"]] >= bound[[2]][1] &
sampledParms[["di"]] <= bound[[3]][2] & sampledParms[["di"]] >= bound[[3]][1] &
sampledParms[["dlim"]] <= bound[[4]][2] & sampledParms[["dlim"]] >= bound[[4]][1]){
# Accept and call mcmc.evel to evaluate jumping parameters and append to chain
# mcmc.eval will return x_i+1 if accepted or x_i if not
chain[[i]] <- mcmc.eval(chain[[i-1]],sampledParms,loss_function,MH.synthesize,
actuals,actual.t,q.abd,eps_min)
} else{
# Reject parameters and keep previous iteration
chain[[i]] <- chain[[i-1]]
}
}
# return chain with MCMC samples
return(chain)
}
# Using best fit from initial monte carlo calls, perform metropolis hasting algoritm
# Define best fit
# Combine costs from oil.forecast
# Use lapply and do.call to combine each object loss into a single dataframe
bestfit.cost <- do.call(rbind,lapply(oil.forecast,function(x){
return(x$Cost)
}))
# Find min cost as best fit
bestfit <- oil.forecast[[which.min(bestfit.cost)]]
# Define jumping distributions as DefineDistribution objects
qi_jump <- new("DefineDistribution",type = "Normal",min_mean = 0, max_sd = 0.2)
b_jump <- new("DefineDistribution",type = "Normal",min_mean = 0, max_sd = 0.2)
di_jump <- new("DefineDistribution",type = "Normal",min_mean = 0, max_sd = 0.4)
dlim_jump <- new("DefineDistribution",type = "Uniform",min_mean = o.dlim, max_sd = o.dlim)
# Combine into single list
jump <- list(qi_jump,b_jump,di_jump,dlim_jump)
# Combine bounding distributions into bound
bounds <- list(qi,b,di,dlim)
# Call MH.mcmc to perform MCMC evaluation
oil.mcmc <- MH.mcmc(iterations = 10000, initial = bestfit,
loss_function = pdca_loss, jump = jump, bounds = bounds,
actuals = singleWell[["Oil.Rate"]], actual.t = synth.time, q.abd = o.abd)
# Need any sort of "kick away from these bounds? Or let it iterate until it gets away itself?
# Look at changing the jumping distributions as well
# Add Di !< 1 error handler in MH function as well?
oil.mcmc.plot <- ggplot.multimodel.costcolor(actuals.x = singleWell[["CD.Month"]],
actuals.y = singleWell[["Oil.Rate"]],
iterationObject = oil.mcmc,
scales = "semi-log")
# Add titles and output
oil.mcmc.plot <- oil.mcmc.plot +
xlab("Cal-Day Month") +
ylab("Oil Rate (bbls/d)") +
ggtitle("PDCA Loss MCMC Output")
# Plot Monte Carlo PDCA plot
oil.plot.pdcaP1
# Add mcmc plot to output
oil.mcmc.plotComparing Top 10% Monte Carlo Matches using PDCA Loss to MCMC PDCA Loss Output
We can see that although the MCMC output may require some tuning, it returns 10,000 matches that describe the actuals data very well in a probabilistic manner. They have a tighter spread around the actuals with more forecasts for probabilistics. Let’s look at the distributions of the matched parameters to look for improvements over the top 100 matches from the initial Monte Carlo run.
Accepted match parameters using MCMC method and PDCA Loss
The resulting distributions are dependent on the jumping distributions, which need to be tuned appropriately, but the accepted distributions follow both normal and lognormal distributions. How do the probabilistic forecast outcomes look for this MCMC run?
Darcy Redpath
Petronas Canada
dredpath@petronascanada.com